Spring R ’24
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
| # | Date | Title |
|---|---|---|
| 1 | 2/22 | Getting Started with Tabular Data |
| 2 ⭐ | 2/29 | Working with Data Frames |
| 3 | 3/7 | Data Visualization |
| 4 | 3/21 | Inference and Modeling Intro |
| 5 | 3/28 | Machine Learning Intro |
names(df), and may be accessed by df$variable_name (where df is the data frame of interest)Process diagram of the Cross-industry standard process for data mining (CRISP-DM). Image credit: Kenneth Jensen, CC BY-SA 3.0, via Wikimedia Commons.
The functionality we’re learning about today will enable us to:
Anecdotally, “everyone is interested in modeling, but 90% of the work is in the prerequisite Business Understanding, Data Understanding, and Data Preparation.”
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
The goal of tidyr is to help you create tidy data. Tidy data is data where:
- Every column is variable.
- Every row is an observation.
- Every cell is a single value.
library(palmerpenguins)
# load palmerpenguins' data into your environment:
data(penguins)
names(penguins)[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
palmerpenguins is one of many examples of an R package which functions as a downloadable data set.
%>%%>% (percent greater-than percent)exprA %>% exprB evaluates expression A, and then sends its output to expression B as inputNote that the first argument of round() has disappeared in the piped version, because it is filled by the mean just calculated. 1 is the digits argument, i.e., one decimal place.
We can string together as many piped expressions as we want. For example, this code calculates temperature changes from three experimental trials, averages and rounds them, then prints a short statement:
Like other expressions, a pipeline can have its result assigned to an object.
Revising the same example: the value is calculated and assigned in its own pipe (as delta_t_avg), then combined with text afterwards.
| Function | Windows | macOS |
|---|---|---|
| Execute line | Ctrl-Enter | ⌘-Enter |
Assignment operator <- |
Alt - (Alt-hyphen) | ⌥ - (Option-hyphen) |
Pipe operator %>% |
Ctrl-Shift-M | ⌘-Shift-M |
Pipe operator
ex2.1-pipe-operator.qmd
Practice…
💡 Ctrl-Shift-M / ⌘-Shift-M inserts %>% (pipe)
For cases (rows):
filter() returns cases which match 1+ logical condition(s)arrange() returns cases sorted according to 1+ variable(s)For variables (columns):
select() returns only certain variables of the datarename() renames variablesmutate() creates new variablesArtwork by @allison_horst (CC BY 4.0)
filter() subsets casesfilter(df, ...) where df is the data frame and ... are 1+ logical expressions; each case that tests TRUE to the condition(s) will appear in the output
TRUE or FALSE when evaluated| Syntax | Example(s) | Name | Notes |
|---|---|---|---|
< <= > >= == |
|
Comparators | == means “equals” and works with both numeric and character data. |
%in% |
|
Membership (or “in”) operator | Asks, “is the value found in vector \(y\)?” |
is.na() |
|
is.na | Asks, “is the value missing?” Missingness is represented in R with NA (see next slide) |
& | |
age < 34 & age >= 25 |
Boolean AND and OR operators | Combine logical expressions |
Missing data or the concept of missingness is represented in R with the symbol NA (not the string "NA" in quotation marks), for “Not Available”. It is equivalent to an empty cell in Excel.
What NA can mean (depending on context)
Consequences of NA
NA (NA\(\ \neq 0\)), or make number-line comparisonsNA must be removed for arithmetic such as mean(); see na.rm argument and similar for many functionsNA for a variable of interest may need to be dropped or imputed for analysisfilter()🐧
arrange() sorts casesarrange(df, ...) returns df with cases sorted according to one or more variablesdesc()select() and rename() work on variablesselect(df, v1, v2, v3), where v1 etc. are variable names, is for selecting which variables you want to retain in the data frame.
- ) to negate a column (i.e., “all variables except…”)rename(df, new_name = old_name) renames a variable old_name to new_namemutate() creates a new variable in the data framemutate(df, new_variable = expr) creates new_variable in dfexpr may be: a mathematical expression, a function call, a vector of appropriate length, or a fixed value. expr may also implement “if-then” logic, using one or more variables of the same case (e.g., “If temperature is above 90, then heat category is High”).# isolate body masses, then convert penguin mass from g to kg and lbs:
penguins %>%
select(body_mass_g) %>%
mutate(body_mass_kg = body_mass_g / 1000,
body_mass_lbs = body_mass_g / 453.6)
# to store the result back to penguins:
penguins <- penguins %>%
mutate(body_mass_kg = body_mass_g / 1000,
body_mass_lbs = body_mass_g / 453.6)A common mutate() task is to convert an existing variable’s type or measurement units. For example, our penguins_raw$Island variable is encoded as character data, but we would like to convert it to a categorical variable. In R, a categorical variable is encoded in a factor, a vector which accepts only certain values. Factors may be ordered.
summarize() returns a single row of summary calculationsmean(), median()sd(), IQR()min(), max()first(), last(), nth()n(), n_distinct()any(), all()summarize()The glimpse() function is used because its compact vertical view is perfect for a single-row table.
# median, mean, and SD of body mass:
penguins %>%
summarize(count = n(),
median_mass = median(body_mass_g, na.rm = TRUE),
mean_mass = mean(body_mass_g, na.rm = TRUE),
sd_mass = sd(body_mass_g, na.rm = TRUE)) %>%
glimpse()
# mean of each numeric variable:
penguins %>%
summarize(across(is.numeric, mean, na.rm=TRUE)) %>%
glimpse()group_by() creates groupings using a variablegroup_by(v1) groups cases in the data according to their value for the variable (factor) v1.summarize() understands these groups and applies function calls to the groups, rather than the whole data set, returning one row for each groupgroup_by(v1, v2) groups cases by v1 and then v2 (order does matter)group_by()# mean and sd body mass of each observed species:
penguins %>%
group_by(species) %>%
summarize(n = n(),
mean_mass = mean(body_mass_g, na.rm=TRUE),
sd_mass = sd(body_mass_g, na.rm=TRUE)) %>%
glimpse()
# mean of each numeric variable for each species and sex:
penguins %>%
group_by(species, sex) %>%
summarize(n = n(),
across(is.numeric, mean, na.rm=TRUE))
# note groups for sex == NAData frame manipulation
ex2.2-data-frame.qmd
Practice…
Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
pivot_longer() collapses 3+ columns into two…creating more rows (a “longer” data frame)
pivot_longer(cols, names_to, values_to) where
cols is a vector of variable names (or selection such as using across()), whose columns you want to collapsenames_to is the name of a new variable which will receive the names of the collapsing columnsvalues_to is the name of a new variable which will receive the values of the collapsing columnspivot_longer() example: religion and incometidyr’s relig_income dataset, from a Pew religion and income survey, has 1 row per religion and a column for the count of people in each income category. Let’s treat each count as its own observation. (This and following example are from the tidyr Pivoting vignette, also available by running vignette("pivot").)
pivot_wider() expands two columns into more…creating more variables (a “wider” data frame)
pivot_wider(names_from, values_from) where
names_from is the name of a variable whose values will form the names of the new variablesvalues_from is the name of a variable whose values will form the values of the new variables, corresponding to the appropriate namepivot_wider() example: fish encountersIt’s relatively rare to need
pivot_wider()to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by other tools. Thefish_encountersdataset, contributed by Myfanwy Johnston, describes when fish swimming down a river are detected by automatic monitoring stations. Many tools used to analyse this data need it in a form where each station is a column.
data(fish_encounters)
names(fish_encounters)
# sample 5 random rows:
slice_sample(fish_encounters, n = 5)
fish_encounters_wide <- fish_encounters %>%
pivot_wider(names_from = station,
values_from = seen)
# use glimpse() for previewing a wide table
slice_sample(fish_encounters_wide, n = 5) %>%
glimpse()Data of interest often “live” in more than one table.
Detail of a relational database diagram, relating Patients records to their Diagnoses and Medications. Image credit: Tsedenjav.Sh, CC BY-SA 4.0, via Wikimedia Commons.
left_join() treats the left table as primaryleft_join(x, y, by) where x is the left table, y is the right table, and by is the variable name by/on which to joinx will be retained in the output, whether they match in y or noty for each row of xleft_join() example: penguin speciesLet us supplement our penguin observations with a table of information about the penguin species. Note that not all of our observed species in x exist in y. Note also that there is a species in y that we have not observed in x.
right_join() treats the right df as the primary table, keeping all its rowsinner_join() returns the minimal set, because it requires values in both tablesfull_join() returns the maximal set, keeping all rows from both df’sToday we learned about:
Join us next week for data visualization!
R 2: Working with Data Frames